Computational Lab: 2-12-10
- Start with the files acghTable.dat and probeList.dat. Using the probeIDs as unique identifiers, match each probe's value to it's position and transform the data for sample "0001" into lff format.
- Take the scores from column "0001" and insert them into the 'score' field of your lff.
- Filter out any probes with no value (or "NA").
- Use "+" for each strand value
- The phase, qstart, and qstop fields should be filled in with dots (".")
- create an attribute-value pair in the 13th column called "sample" with the value "0001"
- deliverables: the ruby script you use to create the lff file
- Upload this data into Genboree (you may either use the API or upload it manually). Then, use the Segmentation tool (under Tools > Plugins) to select regions of high copy-number variance. Require that each segment contain at least 3 probes, and that the score of each segment exceed two standard deviations from the mean.
- Combine the resulting track with the segmented data from the other 185 tumors (all185.acgh.lff.gz) and select out only those segments that represent gains on chromosome 12, using the Annotation selector tool.
- note: you don't have to unzip the file before uploading to Genboree
- Now, upload the file refSeq.blocked.noSplice.lff.gz to your database. It will create a track called "RefSeq:Blocked" This is the refseq genes track with intronic sequences treated as part of the gene, and all of the the splice variants removed. Use the Attribute Lifter tool in Genboree to lift in the sample names from chr 12 gains that hit these genes.
- Click on the track name and use the tabular view to create a table with two columns - the gene name, and a comma-seperated list of matching samples. Download this table, then write a small ruby script that parses this table, and outputs only the few genes that are altered in more than 20 samples.
- note: the "numIntersects" field is not a reliable indicator of how many samples match, only how many distinct annotations match. You'll have to write a script to count the entries in the samples attribute.
- deliverables: the ruby script that parses your tabular output.
- Note: Step 6 is no longer required. Do up to step 5, then submit a list containing two columns: The first column containing the genes on chromosome 12 that have gains in more than 20 samples, and the second column the exact number of samples that they're altered in.
All deliverables:
- ruby script from step 1 that creates your lff file
- ruby script from step 5 that parses your table
- two-column output from step 5
This assignment will be due before the review session, on Feb 24, 2010.
Zip the files up, title the zip with your name, and send them to chrisamiller@gmail.com.
Feel free to contact me if you're having any problems. Email is usually the best way, and I'll almost always respond within an hour or two. We can also arrange a meeting - email me and we'll work out the details.
I'll look over early submissions and if there are major problems, I'll return them to you and give you a chance to resubmit. Assignments completed closer to the due date may not get this opportunity.